ggplot2 is an R package for producing statistical, or data, graphics, but it is unlike most other graphics packages because it has a deep underlying grammar.This grammar, based on the Grammar of Graphics (Wilkinson, 2005), is made up of a set of independent components that can be composed in many different ways. This makes ggplot2 very powerful because you are not limited to a set of pre-specified graphics, but you can create new graphics that are precisely tailored for your problem. This may sound overwhelming, but because there is a simple set of core principles and very few special cases, ggplot2 is also easy to learn (although it may take a little time to forget your preconceptions from other graphics tools).
Practically, ggplot2 provides beautiful, hassle-free plots that take care of fiddly details like drawing legends. The plots can be built up iteratively and edited later. A carefully chosen set of defaults means that most of the time you can produce a publication-quality graphic in seconds, but if you do have special formatting requirements, a comprehensive theming system makes it easy to do what you want. Instead of spending time making your graph look pretty, you can focus on creating a graph that best reveals the messages in your data.
Wilkinson (2005) created the grammar of graphics to describe the deep features that underlie all statistical graphics. The grammar of graphics is an answer to a question: what is a statistical graphic? The layered grammar of graphics (Wickham, 2010) builds on Wilkinson’s grammar, focussing on the primacy of layers and adapting it for embedding within R. In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Facetting can be used to generate the same plot for different subsets of the dataset. It is the combination of these independent components that make up a graphic.
As the book progresses, the formal grammar will be explained in increasing detail. The first description of the components follows below. It introduces some of the terminology that will be used throughout the book and outlines the basic responsibilities of each component. Don’t worry if it doesn’t all make sense right away: you will have many more opportunities to learn about the pieces and how they fit together. All plots are composed of:
Data that you want to visualise and a set of aesthetic mappings describing how variables in the data are mapped to aesthetic attributes that you can perceive.
Layers made up of geometric elements and statistical transformation. Geometric objects, geoms for short, represent what you actually see on the plot: points, lines, polygons, etc. Statistical transformations, stats for short, summarise data in many useful ways. For example, binning and counting observations to create a histogram, or summarising a 2d relationship with a linear model.
The scales map values in the data space to values in an aesthetic space, whether it be colour, or size, or shape. Scales draw a legend or axes, which provide an inverse mapping to make it possible to read the original data values from the plot.
A coordinate system, coord for short, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to make it possible to read the graph.We normally use a Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.
A faceting specification describes how to break up the data into subsets and how to display those subsets as small multiples. This is also known as conditioning or latticing/trellising.
A theme which controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot. A good starting place is Tufte’s early works (Tufte, 1990, 1997, 2001).
It doesn’t suggest what graphics you should use to answer the questions you are interested in. While this book endeavours to promote a sensible process for producing plots of data, the focus of the book is on how to produce the plots you want, not knowing what plots to produce. For more advice on this topic, you may want to consult Robbins (2013), Cleveland (1993), Chambers et al. (1983), and Tukey (1977).
It does not describe interactivity: the grammar of graphics describes only static graphics and there is essentially no benefit to displaying them on a computer screen as opposed to a piece of paper. ggplot2 can only create static graphics, so for dynamic and interactive graphics you will have to look elsewhere (perhaps at ggvis, described below). Cook and Swayne (2007) provides an excellent introduction to the interactive graphics package GGobi. GGobi can be connected to R with the rggobi package (Wickham et al., 2008).
There are a number of other graphics systems available in R: base graphics, grid graphics and trellis/lattice graphics. How does ggplot2 differ from them?
ggplot2, started in 2005, is an attempt to take the good things about base and lattice graphics and improve on them with a strong underlying model which supports the production of any kind of statistical graphic, based on the principles outlined above. The solid underlying model of ggplot2 makes it easy to describe a wide range of graphics with a compact syntax, and independent components make extension easy. Like lattice, ggplot2 uses grid to draw the graphics, which means you can exercise much low-level control over the appearance of the plot.
Work on ggvis, the successor to ggplot2, started in 2014. It takes the foundational ideas of ggplot2 but extends them to the web and interactive graphics. The syntax is similar, but it’s been re-designed from scratch to take advantage of what I’ve learned in the 10 years since creating ggplot2. The most exciting thing about ggvis is that it’s interactive and dynamic, so plots automatically re-draw themselves when the underlying data or plot specification changes. However, ggvis is work in progress and currently can create only a fraction of the plots in ggplot2 can. Stay tuned for updates!
htmlwidgets, http://www.htmlwidgets.org, provides a common framework for accessing web visualisation tools from R. Packages built on top of htmlwidgets include leaflet (https://rstudio.github.io/leaflet/, maps), dygraph (http://rstudio.github.io/dygraphs/, time series) and networkD3 (http://christophergandrud.github.io/networkD3/,networks). htmlwidgets is to ggvis what the many specialised graphic packages are to ggplot2: it provides graphics honed for specific purposes.
Many other R packages, such as vcd (Meyer et al., 2006), plotrix (Lemon et al., 2006) and gplots (Warnes, 2015), implement specialist graphics, but no others provide a framework for producing statistical graphics. A comprehensive list of all graphical tools available in other packages can be found in the graphics task view at http://cran.r-project.org/web/views/ Graphics.html.
To use ggplot2, you must first install it. Make sure you have a recent version of R (at least version 3.2.0) from http://r-project.org and then run the following code to download and install ggplot2:
install.packages(“ggplot2”).
data,
A set of aesthetic mappings between variables in the data and visual properties, and
At least one layer which describes how to render each observation. Layers are usually created with a geom function.
library(ggplot2)
library(dplyr)
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
Data: mpg.
Aesthetic mapping: engine size mapped to x position, fuel economy to y position.
Layer: points.
library(ggplot2)
library(dplyr)
ggplot(mpg, aes(model, manufacturer)) + geom_point()
library(ggplot2)
library(dplyr)
ggplot(mpg, aes(cty, hwy)) + geom_point()
library(ggplot2)
library(dplyr)
ggplot(diamonds, aes(carat, price)) + geom_point()
library(ggplot2)
library(dplyr)
ggplot(economics, aes(date, unemploy)) + geom_line()
library(ggplot2)
library(dplyr)
ggplot(mpg, aes(cty)) + geom_histogram()
library(ggplot2)
library(dplyr)
ggplot(mpg, aes(displ, cty, colour = class)) + geom_point()
library(ggplot2)
library(dplyr)
library(gridExtra)
g1<-ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
g2<-ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")
grid.arrange(g1,g2,ncol=2)
Another technique for displaying additional categorical variables on a plot is facetting. Facetting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset. You’ll learn more about facetting in Sect. 7.2, but it’s such a useful technique that you need to know it right away.
There are two types of facetting: grid and wrapped. Wrapped is the most useful, so we’ll discuss it here, and you can learn about grid facetting later. To facet a plot you simply add a facetting specification with facet wrap(), which takes the name of a variable preceded by ~.
library(ggplot2)
library(dplyr)
library(gridExtra)
ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~class)
library(ggplot2)
library(dplyr)
library(gridExtra)
ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_grid(.~class)+geom_smooth(method="lm",se=FALSE)
library(ggplot2)
library(dplyr)
library(gridExtra)
ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_grid(class~.)+geom_smooth(method="lm",se=FALSE)
library(ggplot2)
library(dplyr)
library(gridExtra)
ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth()
This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence intervals shown in grey. If you’re not interested in the confidence interval, turn it off with geom smooth(se= FALSE). An important argument to geom smooth() is the method, which allows you to choose which type of model is used to fit the smooth curve:
method = “loess”, the default for small n, uses a smooth local regression (as described in ?loess). The wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly).
library(ggplot2)
library(dplyr)
library(gridExtra)
g1<-ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(span = 0.2)
g2<-ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(span = 1)
grid.arrange(g1,g2,ncol=2)
Loess does not work well for large datasets (it’s O(n2) in memory), so an alternative smoothing algorithm is used when n is greater than 1000.
method = “gam” fits a generalised additive model provided by the mgcv package. You need to first load mgcv, then use a formula like formula = y~ s(x) or y ~ s(x, bs = “cs”) (for large data). This is what ggplot2 uses when there are more than 1000 points.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method = "gam", formula = y ~ s(x))
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method = "lm")
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
ggplot(mpg, aes(displ, hwy)) + geom_point() + geom_smooth(method = "rlm")
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
ggplot(mpg, aes(drv, hwy)) + geom_point()
Jittering, geom jitter(), adds a little random noise to the data which can help avoid overplotting.
Boxplots, geom boxplot(), summarise the shape of the distribution with a handful of summary statistics.
Violin plots, geom violin(), show a compact representation of the “density” of the distribution, highlighting the areas where more points are found.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-ggplot(mpg, aes(drv, hwy)) + geom_jitter()
g2<-ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
g3<-ggplot(mpg, aes(drv, hwy)) + geom_violin()
grid.arrange(g1,g2,g3,ncol=3)
Each method has its strengths and weaknesses. Boxplots summarise the bulk of the distribution with only five numbers, while jittered plots show every point but only work with relatively small datasets. Violin plots give the richest display, but rely on the calculation of a density estimate, which can be hard to interpret.
For jittered points, geom jitter() offers the same control over aesthetics as geom point(): size, colour, and shape. For geom boxplot() and geom violin(), you can control the outline colour or the internal fill colour.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-ggplot(mpg, aes(hwy)) + geom_histogram()
g2<-ggplot(mpg, aes(hwy)) + geom_freqpoly()
grid.arrange(g1,g2,ncol=2)
Both histograms and frequency polygons work in the same way: they bin the data, then count the number of observations in each bin. The only difference is the display: histograms use bars and frequency polygons use lines.
You can control the width of the bins with the binwidth argument (if you don’t want evenly spaced bins you can use the breaks argument). It is very important to experiment with the bin width. The default just splits your data into 30 bins, which is unlikely to be the best choice. You should always try many bin widths, and you may find you need multiple bin widths to tell the full story of your data.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 2.5)
g2<-ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 1)
grid.arrange(g1,g2,ncol=2)
An alternative to the frequency polygon is the density plot, geom density(). I’m not a fan of density plots because they are harder to interpret since the underlying computations are more complex. They also make assumptions that are not true for all data, namely that the underlying distribution is continuous, unbounded, and smooth.
To compare the distributions of different subgroups, you can map a categorical variable to either fill (for geom histogram()) or colour (for geom freqpoly()). It’s easier to compare distributions using the frequency polygon because the underlying perceptual task is easier. You can also use facetting: this makes comparisons a little harder, but it’s easier to see the distribution of each group.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-ggplot(mpg, aes(displ, colour = drv)) + geom_freqpoly(binwidth = 0.5)
g2<-ggplot(mpg, aes(displ, fill = drv)) + geom_histogram(binwidth = 0.5) + facet_wrap(~drv, ncol = 1)
grid.arrange(g1,g2,ncol=2)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-ggplot(mpg, aes(displ, colour = drv)) + geom_density()
g2<-ggplot(mpg, aes(displ, colour = drv)) + geom_density() + facet_wrap(~drv, ncol = 1)
grid.arrange(g1,g2,ncol=2)
g1<-ggplot(mpg, aes(displ, fill = drv)) + geom_density()
g2<-ggplot(mpg, aes(displ, fill = drv)) + geom_density() + facet_wrap(~drv, ncol = 1)
grid.arrange(g1,g2,ncol=2)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
ggplot(mpg, aes(manufacturer)) + geom_bar()
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
drugs <- data.frame(
drug = c("a", "b", "c"),
effect = c(4.2, 9.7, 6.1)
)
g1<-ggplot(drugs, aes(drug, effect)) + geom_bar(stat = "identity")
g2<-ggplot(drugs, aes(drug, effect)) + geom_point()
grid.arrange(g1,g2,ncol=2)
Line and path plots are typically used for time series data. Line plots join the points from left to right, while path plots join them in the order that they appear in the dataset (in other words, a line plot is a path plot of the data sorted by x value). Line plots usually have time on the x-axis, showing how a single variable has changed over time. Path plots show how two variables have simultaneously changed over time, with time encoded in the way that observations are connected.
Because the year variable in the mpg dataset only has two values, we’ll show some time series plots using the economics dataset, which contains economic data on the US measured over the last 40 years. The figure below shows two plots of unemployment over time, both produced using geom line(). The first shows the unemployment rate while the second shows the median number of weeks unemployed.We can already see some differences in these two variables, particularly in the last peak, where the unemployment percentage is lower than it was in the preceding peaks, but the length of unemployment is high.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-ggplot(economics, aes(date, unemploy / pop)) + geom_line()
g2<-ggplot(economics, aes(date, uempmed)) + geom_line()
grid.arrange(g1,g2,ncol=2)
To examine this relationship in greater detail, we would like to draw both time series on the same plot. We could draw a scatterplot of unemployment rate vs. length of unemployment, but then we could no longer see the evolution over time. The solution is to join points adjacent in time with line segments, forming a path plot.
Below we plot unemployment rate vs. length of unemployment and join the individual observations with a path. Because of the many line crossings, the direction in which time flows isn’t easy to see in the first plot. In the second plot, we colour the points to make it easier to see the direction of time.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-ggplot(economics, aes(unemploy / pop, uempmed)) + geom_path() + geom_point()
year <- function(x) as.POSIXlt(x)$year + 1900
g2<-ggplot(economics, aes(unemploy / pop, uempmed)) + geom_path(colour = "grey50") +
geom_point(aes(colour = year(date)))
grid.arrange(g1,g2,ncol=2)
We can see that unemployment rate and length of unemployment are highly correlated, but in recent years the length of unemployment has been increasing relative to the unemployment rate.
With longitudinal data, you often want to display multiple time series on each plot, each series representing one individual. To do this you need to map the group aesthetic to a variable encoding the group membership of each observation. This is explained in more depth in Sect. 3.5.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-ggplot(mpg, aes(cty, hwy)) + geom_point(alpha = 1 / 3)
g2<-ggplot(mpg, aes(cty, hwy)) + geom_point(alpha = 1 / 3) + xlab("city driving (mpg)") +
ylab("highway driving (mpg)")
# Remove the axis labels with NULL
g3<-ggplot(mpg, aes(cty, hwy)) + geom_point(alpha = 1 / 3) + xlab(NULL) + ylab(NULL)
grid.arrange(g1,g2,g3,ncol=3)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-ggplot(mpg, aes(drv, hwy)) + geom_jitter(width = 0.25)
g2<-ggplot(mpg, aes(drv, hwy)) + geom_jitter(width = 0.25) + xlim("f", "r") + ylim(20, 30)
# For continuous scales, use NA to set only one limit
g3<-ggplot(mpg, aes(drv, hwy)) + geom_jitter(width = 0.25, na.rm = TRUE) + ylim(NA, 30)
grid.arrange(g1,g2,g3,ncol=3)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
p <- ggplot(mpg, aes(displ, hwy, colour = factor(cyl))) + geom_point()
print(p)
summary(p)
data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
class [234x11]
mapping: x = ~displ, y = ~hwy, colour = ~factor(cyl)
faceting: <ggproto object: Class FacetNull, Facet, gg>
compute_layout: function
draw_back: function
draw_front: function
draw_labels: function
draw_panels: function
finish_data: function
init_scales: function
map_data: function
params: list
setup_data: function
setup_params: function
shrink: TRUE
train_scales: function
vars: function
super: <ggproto object: Class FacetNull, Facet, gg>
-----------------------------------
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity
Render it on screen with print(). This happens automatically when running interactively, but inside a loop or function, you’ll need to print() it yourself.
• Save it to disk with ggsave(), described in Sect. 8.5. Save png to disk ggsave(“plot.png”, width = 5, height = 5)
Briefly describe its structure with summary().
Save a cached copy of it to disk, with saveRDS(). This saves a complete copy of the plot object, so you can easily re-create it with readRDS(). saveRDS(p, “plot.rds”) q <- readRDS(“plot.rds”)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-qplot(displ, hwy, data = mpg)
g2<-qplot(displ, data = mpg)
grid.arrange(g1,g2,ncol=2)
Unless otherwise specified, qplot() tries to pick a sensible geometry and statistic based on the arguments provided. For example, if you give qplot() x and y variables, it’ll create a scatterplot. If you just give it an x, it’ll create a histogram or bar chart depending on the type of variable.
qplot() assumes that all variables should be scaled by default. If you want to set an aesthetic to a constant, you need to use I():
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-qplot(displ, hwy, data = mpg, colour = "blue")
g2<-qplot(displ, hwy, data = mpg, colour = I("blue"))
grid.arrange(g1,g2,ncol=2)
To display the data. We plot the raw data for many reasons, relying on our skills at pattern detection to spot gross structure, local structure, and outliers. This layer appears on virtually every graphic. In the earliest stages of data exploration, it is often the only layer.
To display a statistical summary of the data. As we develop and explore models of the data, it is useful to display model predictions in the context of the data. Showing the data helps us improve the model, and showing the model helps reveal subtleties of the data that we might otherwise miss. Summaries are usually drawn on top of the data.
To add additional metadata: context, annotations, and references. A metadata layer displays background context, annotations that help to give meaning to the raw data, or fixed references that aid comparisons across panels. Metadata can be useful in the background and foreground. A map is often used as a background layer with spatial data. Background metadata should be rendered so that it doesn’t interfere with your perception of the data, so is usually displayed underneath the data and formatted so that it is minimally perceptible. That is, if you concentrate on it, you can see it with ease, but it doesn’t jump out at you when you are casually browsing the plot.
These geoms are the fundamental building blocks of ggplot2. They are useful in their own right, but are also used to construct more complex geoms. Most of these geoms are associated with a named plot: when that geom is used by itself in a plot, that plot has a special name.
Each of these geoms is two dimensional and requires both x and y aesthetics. All of them understand colour (or color) and size aesthetics, and the filled geoms (bar, tile and polygon) also understand fill.
geom area() draws an area plot, which is a line plot filled to the y-axis (filled lines). Multiple groups will be stacked on top of each other.
geom bar(stat = “identity”) makes a bar chart.We need stat = “identity” because the default stat automatically counts values (so is essentially a 1d geom, see Sect. 3.11. The identity stat leaves the data unchanged. Multiple bars in the same location will be stacked on top of one another.
geom line() makes a line plot. The group aesthetic determines which observations are connected; see Sect. 3.5 for more detail. geom line() connects points from left to right; geom path() is similar but connects points in the order they appear in the data. Both geom line() and geom path() also understand the aesthetic linetype, which maps a categorical variable to solid, dotted and dashed lines.
geom point() produces a scatterplot. geom point() also understands the shape aesthetic.
geom polygon() draws polygons, which are filled paths. Each vertex of the polygon requires a separate row in the data. It is often useful to merge a data frame of polygon coordinates with the data just prior to plotting. Section 3.7 illustrates this concept in more detail for map data.
• geom rect(), geom tile() and geom raster() draw rectangles. geom rect() is parameterised by the four corners of the rectangle, xmin, ymin, xmax and ymax. geom tile() is exactly the same, but parameterised by the center of the rect and its size, x, y, width and height. geom raster() is a fast special case of geom tile() used when all the tiles are the same size.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) +
labs(x = NULL, y = NULL) + # Hide axis label
theme(plot.title = element_text(size = 12)) # Shrink plot title
g1<-p + geom_point() + ggtitle("point")
g2<-p + geom_text() + ggtitle("text")
g3<-p + geom_bar(stat = "identity") + ggtitle("bar")
g4<-p + geom_tile() + ggtitle("raster")
grid.arrange(g1,g2,g3,g4,ncol=4)
#some more plots
g5<-p + geom_line() + ggtitle("line")
g6<-p + geom_area() + ggtitle("area")
g7<-p + geom_path() + ggtitle("path")
g8<-p + geom_polygon() + ggtitle("polygon")
grid.arrange(g5,g6,g7,g8,ncol=4)
Adding text to a plot can be quite tricky. ggplot2 doesn’t have all the answers, but does provide some tools to make your life a little easier. The main tool is geom text(), which adds labels at the specified x and y positions. geom text() has the most aesthetics of any geom, because there are so many ways to control the appearance of a text:
family gives the name of a font. There are only three fonts that are guaranteed to work everywhere: “sans” (the default), “serif”, or “mono”:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
df <- data.frame(x = 1, y = 3:1, family = c("sans", "serif", "mono"))
ggplot(df, aes(x, y)) + geom_text(aes(label = family, family = family))
showtext, https://github.com/yixuan/showtext, by Yixuan Qiu, makes GD-independent plots by rendering all text as polygons.
extrafont, https://github.com/wch/extrafont, by Winston Chang, converts fonts to a standard format that all devices can use.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
df <- data.frame(x = 1, y = 3:1, face = c("plain", "bold", "italic"))
ggplot(df, aes(x, y)) + geom_text(aes(label = face, fontface = face))
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
df <- data.frame(
x = c(1, 1, 2, 2, 1.5),
y = c(1, 2, 1, 2, 1.5),
text = c(
"bottom-left", "bottom-right",
"top-left", "top-right", "center"
)
)
g1<-ggplot(df, aes(x, y)) + geom_text(aes(label = text))
g2<-ggplot(df, aes(x, y)) + geom_text(aes(label = text), vjust = "inward", hjust = "inward")
grid.arrange(g1,g2,ncol=2)
size controls the font size. Unlike most tools, ggplot2 uses mm, rather than the usual points (pts). This makes it consistent with other size units in ggplot2. (There are 72.27 pts in a inch, so to convert from points to mm, just multiply by 72.27/25.4.)
angle specifies the rotation of the text in degrees.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
df <- data.frame(trt = c("a", "b", "c"), resp = c(1.2, 3.4, 2.5))
ggplot(df, aes(resp, trt)) +
geom_point() +
geom_text(aes(label = paste0("(", resp, ")")), nudge_y = -0.25) +
xlim(1, 3.6)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
g1<-ggplot(mpg, aes(displ, hwy)) + geom_text(aes(label = model)) + xlim(1, 8)
g2<-ggplot(mpg, aes(displ, hwy)) + geom_text(aes(label = model), check_overlap = TRUE) + xlim(1, 8)
grid.arrange(g1,g2,ncol=2)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
label <- data.frame(
waiting = c(55, 80),
eruptions = c(2, 4.3),
label = c("peak one", "peak two")
)
ggplot(faithfuld, aes(waiting, eruptions)) +
geom_tile(aes(fill = density)) +
geom_label(data = label, aes(label = label))
Text does not affect the limits of the plot. Unfortunately there’s no way to make this work since a label has an absolute size (e.g. 3 cm), regardless of the size of the plot. This means that the limits of a plot would need to be different depending on the size of the plot—there’s just no way to make that happen with ggplot2. Instead, you’ll need to tweak xlim() and ylim() based on your data and plot size.
If you want to label many points, it is difficult to avoid overlaps. check overlap = TRUE is useful, but offers little control over which labels are removed. There are a number of techniques available for base graphics, like maptools::pointLabel(), but they’re not trivial to port to the grid graphics used by ggplot2. If all else fails, you may need to manually label points in a drawing tool.
Text labels can also serve as an alternative to a legend. This usually makes the plot easier to read because it puts the labels closer to the data. The directlabels (https://github.com/tdhock/directlabels) package, by Toby Dylan Hocking, provides a number of tools to make this easier:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(directlabels)
g1<-ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
g2<-ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point(show.legend = FALSE) +
directlabels::geom_dl(aes(label = class), method = "smart.grid")
grid.arrange(g1,g2,ncol=2)
geom text() to add text descriptions or to label points Most plots will not benefit from adding text to every single observation on the plot, but labelling outliers and other important points is very useful.
geom rect() to highlight interesting rectangular regions of the plot. geom rect() has aesthetics xmin, xmax, ymin and ymax.
geom line(), geom path() and geom segment() to add lines. All these geoms have an arrow parameter, which allows you to place an arrowhead on the line. Create arrowheads with arrow(), which has arguments angle, length, ends and type.
geom vline(), geom hline() and geom abline() allow you to add reference lines (sometimes called rules), that span the full range of the plot.
Typically, you can either put annotations in the foreground (using alpha if needed so you can still see the data), or in the background. With the default background, a thick white line makes a useful reference: it’s easy to see but it doesn’t jump out at you.
To show off the basic idea, we’ll draw a time series of unemployment:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
ggplot(economics, aes(date, unemploy)) + geom_line()
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
presidential <- subset(presidential, start > economics$date[1])
ggplot(economics) +
geom_rect(
aes(xmin = start, xmax = end, fill = party),
ymin = -Inf, ymax = Inf, alpha = 0.2,
data = presidential
) +
geom_vline(
aes(xintercept = as.numeric(start)),
data = presidential,
colour = "grey50", alpha = 0.5
) +
geom_text(
aes(x = start, y = 2500, label = name),
data = presidential,
size = 3, vjust = 0, hjust = 0, nudge_x = 50
) +
geom_line(aes(date, unemploy)) +
scale_fill_manual(values = c("blue", "red"))
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
yrng <- range(economics$unemploy)
xrng <- range(economics$date)
caption <- paste(strwrap("Unemployment rates in the US have
varied a lot over the years", 40), collapse = "\n")
ggplot(economics, aes(date, unemploy)) +
geom_line() +
geom_text(
aes(x, y, label = caption),
data = data.frame(x = xrng[1], y = yrng[2], caption = caption),
hjust = 0, vjust = 1, size = 4
)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
ggplot(economics, aes(date, unemploy)) +
geom_line() +
annotate("text", x = xrng[1], y = yrng[2], label = caption,
hjust = 0, vjust = 1, size = 4
)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d() +
facet_wrap(~cut, nrow = 1)
mod_coef <- coef(lm(log10(price) ~ log10(carat), data = diamonds))
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d() +
geom_abline(intercept = mod_coef[1], slope = mod_coef[2],
colour = "white", size = 1) +
facet_wrap(~cut, nrow = 1)
Geoms can be roughly divided into individual and collective geoms. An individual geom draws a distinct graphical object for each observation (row). For example, the point geom draws one point per row. A collective geom displays multiple observations with one geometric object. This may be a result of a statistical summary, like a boxplot, or may be fundamental to the display of the geom, like a polygon. Lines and paths fall somewhere in between: each line is composed of a set of straight segments, but each segment represents two points. How do we control the assignment of observations to graphical elements? This is the job of the group aesthetic.
By default, the group aesthetic is mapped to the interaction of all discrete variables in the plot. This often partitions the data correctly, but when it does not, or when no discrete variable is used in a plot, you’ll need to explicitly define the grouping structure by mapping group to a variable that has a different value for each group.
There are three common cases where the default is not enough, and we will consider each one below. In the following examples, we will use a simple longitudinal dataset, Oxboys, from the nlme package. It records the heights (height) and centered ages (age) of 26 boys (Subject), measured on nine occasions (Occasion). Subject and Occasion are stored as ordered factors.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(Oxboys)
head(Oxboys)
Grouped Data: height ~ age | Subject
Subject age height Occasion
1 1 -1.0000 140.5 1
2 1 -0.7479 143.4 2
3 1 -0.4630 144.8 3
4 1 -0.1643 147.1 4
5 1 -0.0027 147.7 5
6 1 0.2466 150.2 6
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_point() +
geom_line()
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line()
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE)
This is not what we wanted; we have inadvertently added a smoothed line for each boy. Grouping controls both the display of the geoms, and the operation of the stats: one statistical transformation is run for each group.
Instead of setting the grouping aesthetic in ggplot(), where it will apply to all layers, we set it in geom line() so it applies only to the lines. There are no discrete variables in the plot so the default grouping variable will be a constant and we get one smooth:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
ggplot(Oxboys, aes(age, height)) +
geom_line(aes(group = Subject)) +
geom_smooth(method = "lm", size = 2, se = FALSE)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot()
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(colour = "#3366FF", alpha = 0.5)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(aes(group = Subject), colour = "#3366FF", alpha = 0.5)
A final important issue with collective geoms is how the aesthetics of the individual observations are mapped to the aesthetics of the complete entity. What happens when different aesthetics are mapped to a single geometric element?
Lines and paths operate on an off-by-one principle: there is one more observation than line segment, and so the aesthetic for the first observation is used for the first segment, the second observation for the second segment and so on. This means that the aesthetic for the last observation is not used:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
df <- data.frame(x = 1:3, y = 1:3, colour = c(1,3,5))
ggplot(df, aes(x, y, colour = factor(colour))) +
geom_line(aes(group = 1), size = 2) +
geom_point(size = 5)
ggplot(df, aes(x, y, colour = colour)) +
geom_line(aes(group = 1), size = 2) +
geom_point(size = 5)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame(
x = xgrid,
y = approx(df$x, df$y, xout = xgrid)$y,
colour = approx(df$x, df$colour, xout = xgrid)$y
)
ggplot(interp, aes(x, y, colour = colour)) +
geom_line(size = 2) +
geom_point(data = df, size = 5)
An additional limitation for paths and lines is that line type must be constant over each individual line. In R there is no way to draw a line which has varying line type.
For all other collective geoms, like polygons, the aesthetics from the individual components are only used if they are all the same, otherwise the default value is used. It’s particularly clear why this makes sense for fill: how would you colour a polygon that had a different fill colour for each point on its border?
These issues are most relevant when mapping aesthetics to continuous variables, because, as described above, when you introduce a mapping to a discrete variable, it will by default split apart collective geoms into smaller pieces. This works particularly well for bar and area plots, because stacking the individual pieces produces the same shape as the original ungrouped data:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
g1<-ggplot(mpg, aes(class)) +
geom_bar()
g2<-ggplot(mpg, aes(class, fill = drv)) +
geom_bar()
grid.arrange(g1,g2,ncol=2)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
g1<-ggplot(mpg, aes(class, fill = hwy)) +
geom_bar()
g2<-ggplot(mpg, aes(class, fill = hwy, group = hwy)) +
geom_bar()
grid.arrange(g1,g2,ncol=2)
The bars will be stacked in the order defined by the grouping variable. If you need fine control, you’ll need to create a factor with levels ordered as needed.
Modify the following plot so that you get one boxplot per integer value of displ.
library(ggplot2)
library(dplyr)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
g1<-ggplot(mpg, aes(displ, cty)) + geom_boxplot()
g2<-ggplot(mpg, aes(factor(displ), cty)) + geom_boxplot()
grid.arrange(g1,g2,ncol=2)
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
g1<-ggplot(faithfuld, aes(eruptions, waiting)) + geom_contour(aes(z = density, colour = ..level..))
ggplotly(g1)
g2<-ggplot(faithfuld, aes(eruptions, waiting)) + geom_raster(aes(fill = density))
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)
# Bubble plots work better with fewer observations
small <- faithfuld[seq(1, nrow(faithfuld), by = 10), ]
ggplot(small, aes(eruptions, waiting)) +
geom_point(aes(size = density), alpha = 1/3) +
scale_size_area()
lat and long, giving the location of a point.
group, a unique identifier for each contiguous region.
id, the name of the region.
Separate group and id variables are necessary because sometimes a geographical unit isn’t a contiguous polygon. For example, Hawaii is composed of multiple islands that can’t be drawn using a single polygon.
The following code extracts that data from the built in maps package using ggplot2::map data(). The maps package isn’t particularly accurate or up-to-date, but it’s built into R so it’s a reasonable place to start.
NB: Could not complete this part because some of the packages are not available for version 3.6.1.
Discrete x, range: geom errorbar(), geom linerange().
Discrete x, range & center: geom crossbar(), geom pointrange().
Continuous x, range: geom ribbon().
Continuous x, range & center: geom smooth(stat = “identity”)
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
y <- c(18, 11, 16)
df <- data.frame(x = 1:3, y = y, se = c(1.2, 0.5, 1.0))
base <- ggplot(df, aes(x, y, ymin = y - se, ymax = y + se))
g1<-base + geom_crossbar()
g2<-base + geom_pointrange()
g3<-base + geom_smooth(stat = "identity")
g4<-base + geom_errorbar()
g5<-base + geom_linerange()
g6<-base + geom_ribbon()
grid.arrange(g1,g2,g3,g4,g5,g6,ncol=3,nrow=2)
When you have aggregated data where each row in the dataset represents multiple observations, you need some way to take into account the weighting variable. We will use some data collected on Midwest states in the 2000 US census in the built-in midwest data frame. The data consists mainly of percentages (e.g., percent white, percent below poverty line, percent with college degree) and some information for each county (area, total population, population density).
There are a few different things we might want to weight by:
Nothing, to look at numbers of counties.
Total population, to work with absolute numbers.
Area, to investigate geographic effects. (This isn’t useful for midwest, but would be if we had variables like percentage of farmland.)
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
# Unweighted
g1<-ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point()
# Weight by population
g2<-ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
scale_size_area("Population\n(millions)", breaks = c(0.5, 1, 2, 4))
grid.arrange(g1,g2,ncol=2)
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
# Unweighted
g1<-ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point() +
geom_smooth(method = lm, size = 1)
# Weighted by population
g2<-ggplot(midwest, aes(percwhite, percbelowpoverty)) +
geom_point(aes(size = poptotal / 1e6)) +
geom_smooth(aes(weight = poptotal), method = lm, size = 1) +
scale_size_area(guide = "none")
grid.arrange(g1,g2,ncol=2)
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
g1<-ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(binwidth = 1) +
ylab("Counties")
g2<-ggplot(midwest, aes(percbelowpoverty)) +
geom_histogram(aes(weight = poptotal), binwidth = 1) +
ylab("Population (1000s)")
grid.arrange(g1,g2,ncol=2)
To demonstrate tools for large datasets, we’ll use the built in diamonds dataset, which consists of price and quality information for ˜54,000 diamonds:
The data contains the four C’s of diamond quality: carat, cut, colour and clarity; and five physical measurements: depth, table, x, y and z.
The dataset has not been well cleaned, so as well as demonstrating interesting facts about diamonds, it also shows some data quality problems.
There are a number of geoms that can be used to display distributions, depending on the dimensionality of the distribution, whether it is continuous or discrete, and whether you are interested in the conditional or joint distribution.
For 1d continuous distributions the most important geom is the histogram, geom histogram():
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)
g1<-ggplot(diamonds, aes(depth)) + geom_histogram()
g2<-ggplot(diamonds, aes(depth)) + geom_histogram(binwidth = 0.1) + xlim(55, 70)
grid.arrange(g1,g2,ncol=2)
It is important to experiment with binning to find a revealing view. You can change the binwidth, specify the number of bins, or specify the exact location of the breaks. Never rely on the default parameters to get a revealing view of the distribution. Zooming in on the x axis, xlim(55, 70), and selecting a smaller bin width, binwidth = 0.1, reveals far more detail.
When publishing figures, don’t forget to include information about important parameters (like bin width) in the caption.
If you want to compare the distribution between groups, you have a few options:
Show small multiples of the histogram, facet wrap(~ var).
Use colour and a frequency polygon, geom freqpoly().
Use a “conditional density plot”, geom histogram(position = “fill”).
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)
g1<-ggplot(diamonds, aes(depth)) +
geom_freqpoly(aes(colour = cut), binwidth = 0.1, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
g2<-ggplot(diamonds, aes(depth)) +
geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill",
na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
grid.arrange(g1,g2,ncol=2)
(I’ve suppressed the legends to focus on the display of the data.)
Both the histogram and frequency polygon geom use the same underlying statistical transformation: stat = “bin”. This statistic produces two output variables: count and density. By default, count is mapped to y-position, because it’s most interpretable. The density is the count divided by the total count multiplied by the bin width, and is useful when you want to compare the shape of the distributions, not the overall size.
An alternative to a bin-based visualisation is a density estimate. geom density() places a little normal distribution at each data point and sums up all the curves. It has desirable theoretical properties, but is more difficult to relate back to the data. Use a density plot when you know that the underlying density is smooth, continuous and unbounded. You can use the adjust parameter to make the density more or less smooth.
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)
g1<-ggplot(diamonds, aes(depth)) +
geom_density(na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
g2<-ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
geom_density(alpha = 0.2, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
grid.arrange(g1,g2,ncol=2)
Note that the area of each density estimate is standardised to one so that you lose information about the relative size of each group.
The histogram, frequency polygon and density display a detailed view of the distribution. However, sometimes you want to compare many distributions, and it’s useful to have alternative options that sacrifice quality for quantity. Here are three options:
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)
g1<-ggplot(diamonds, aes(clarity, depth)) +
geom_boxplot()
g2<-ggplot(diamonds, aes(carat, depth)) +
geom_boxplot(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
grid.arrange(g1,g2,ncol=2)
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)
g1<-ggplot(diamonds, aes(clarity, depth)) +
geom_violin()
g2<-ggplot(diamonds, aes(carat, depth)) +
geom_violin(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
grid.arrange(g1,g2,ncol=2)
The scatterplot is a very important tool for assessing the relationship between two continuous variables. However, when the data is large, points will be often plotted on top of each other, obscuring the true relationship. In extreme cases, you will only be able to see the extent of the data, and any conclusions drawn from the graphic will be suspect. This problem is called overplotting.
There are a number of ways to deal with it depending on the size of the data and severity of the overplotting. The first set of techniques involves tweaking aesthetic properties. These tend to be most effective for smaller datasets:
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
g1<-norm + geom_point()
g2<-norm + geom_point(shape = 1) # Hollow circles
g3<-norm + geom_point(shape = ".") # Pixel sized
grid.arrange(g1,g2,g3,ncol=3)
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)
g1<-norm + geom_point(alpha = 1 / 3)
g2<-norm + geom_point(alpha = 1 / 5)
g3<-norm + geom_point(alpha = 1 / 10)
grid.arrange(g1,g2,g3,ncol=3)
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)
g1<-norm + geom_bin2d()
g2<-norm + geom_bin2d(bins = 10)
g3<-norm + geom_hex()
g4<-norm + geom_hex(bins = 10)
grid.arrange(g1,g2,g3,g4,ncol=2,nrow=2)
Estimate the 2d density with stat density2d(), and then display using one of the techniques for showing 3d surfaces in Sect. 3.6.
If you are interested in the conditional distribution of y given x, then the techniques of Sect. 2.6.3 will also be useful.
geom histogram() and geom bin2d() use a familiar geom, geom bar() and geom raster(), combined with a new statistical transformation, stat bin() and stat bin2d(). stat bin() and stat bin2d() combine the data into bins and count the number of observations in each bin. But what if we want a summary other than count? So far, we’ve just used the default statistical transformation associated with each geom. Now we’re going to explore how to use stat summary bin() to stat summary 2d() to compute different summaries.
Let’s start with a couple of examples with the diamonds data. The first example in each pair shows how we can count the number of diamonds in each bin; the second shows how we can compute the average price.
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)
g1<-ggplot(diamonds, aes(color)) + geom_bar()
g2<-ggplot(diamonds, aes(color, price)) + geom_bar(stat = "summary_bin", fun.y = mean)
grid.arrange(g1,g2,ncol=2)
library(ggplot2)
library(dplyr)
library(plotly)
library(gridExtra)
library(mgcv)
library(MASS)
library(nlme)
data(diamonds)
g1<-ggplot(diamonds, aes(table, depth)) +
geom_bin2d(binwidth = 1, na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)
g2<-ggplot(diamonds, aes(table, depth, z = price)) +
geom_raster(binwidth = 1, stat = "summary_2d", fun = mean,
na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)
grid.arrange(g1,g2,ncol=2)
To get more help on the arguments associated with the two transformations, look at the help for stat summary bin() and stat summary 2d(). You can control the size of the bins and the summary functions. stat summary bin() can produce y, ymin and ymax aesthetics, also making it useful for displaying measures of spread. See the docs for more details. You’ll learn more about how geoms and stats interact in Sect. 5.6.
These summary functions are quite constrained but are often useful for a quick first pass at a problem. If you find them restraining, you’ll need to do the summaries yourself. See Sect. 10.4 for more details.
animInt, https://github.com/tdhock/animint, lets you make you ggplot2 graphics interactive, adding querying, filtering and linking.
GGally, https://github.com/ggobi/ggally, provides a very flexible scatterplot matrix, amongst other tools.
ggbio, http://www.tengfei.name/ggbio/, provides a number of specialised geoms for genomic data.
ggdendro, https://github.com/andrie/ggdendro, turns data from tree methods in to data frames that can easily be displayed with ggplot2.
ggfortify, https://github.com/sinhrks/ggfortify, provides fortify and autoplot methods to handle objects from some popular R packages.
ggenealogy, https://cran.r-project.org/package=ggenealogy, helps explore and visualise genealogy data.
ggmcmc, http://xavier-fim.net/packages/ggmcmc/, provides a set of flexible tools for visualising the samples generated by MCMC methods.
ggparallel, https://cran.r-project.org/package=ggparallel: easily draw parallel coordinates plots, and the closely related hammock and common angle plots.
ggtern, http://www.ggtern.com, lets you use ggplot2 to draw ternary diagrams, used when you have three variables that always sum to one.
ggtree, https://github.com/GuangchuangYu/ggtree, provides tools to view and annotate phylogenetic tree with different types of meta-data.
granovaGG, https://github.com/briandk/granovaGG, provides tools to visualise ANOVA results.
plotluck, https://github.com/stefan-schroedl/plotluck: the ggplot2 version of Google’s “I’m feeling lucky”. It automatically creates plots for one, two or three variables.